Entity Matching: How Similar Is Similar

نویسندگان

  • Jiannan Wang
  • Guoliang Li
  • Jeffrey Xu Yu
  • Jianhua Feng
چکیده

Entity matching that finds records referring to the same entity is an important operation in data cleaning and integration. Existing studies usually use a given similarity function to quantify the similarity of records, and focus on devising index structures and algorithms for efficient entity matching. However it is a big challenge to define “how similar is similar” for real applications, since it is rather hard to automatically select appropriate similarity functions. In this paper we attempt to address this problem. As there are a large number of similarity functions, and even worse thresholds may have infinite values, it is rather expensive to find appropriate similarity functions and thresholds. Fortunately, we have an observation that different similarity functions and thresholds have redundancy, and we have an opportunity to prune inappropriate similarity functions. To this end, we propose effective optimization techniques to eliminate such redundancy, and devise efficient algorithms to find the best similarity functions. The experimental results on both real and synthetic datasets show that our method achieves high accuracy and outperforms the baseline algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The fuzzy logic in air pollution forecasting ‎model

In the paper a model to predict the concentrations of particulate matter PM10, PM2.5, SO2, NO, CO and O3 for a chosen number of hours forward is proposed. The method requires historical data for a large number of points in time, particularly weather forecast data, actual weather data and pollution data. The idea is that by matching forecast data with similar forecast data in the historical data...

متن کامل

رحم سپتوم دار با دوپلیکاسیون سرویکس: دومین گزارش جهان

Mullerian anomalies are one of the interesting but uncommon entities that gynecologists confront. The incidence is 1-6%, It is difficult to" anticipate the real incidence, because most of information is obtained from infertile or complicated patients with inadequate work-up. Recently endoscopic pocedures reveal more details about these anomalies. Today, classification of Buttram & Gibbons (modi...

متن کامل

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

EMBench: Generating Entity-Related Benchmark Data

The entity matching task aims at identifying whether instances are referring to the same real world entity. It is considered as a fundamental task in data integration and cleaning techniques. More recently, the entity matching task has also become a vital part in techniques focusing on entity search and entity evolution. Unfortunately, the existing data sets and benchmarking systems are not abl...

متن کامل

A Block-Grouping Method for Image Denoising by Block Matching and 3-D Transform Filtering

Image denoising by block matching and threedimensionaltransform filtering (BM3D) is a two steps state-ofthe-art algorithm that uses the redundancy of similar blocks innoisy image for removing noise. Similar blocks which can havesome overlap are found by a block matching method and groupedto make 3-D blocks for 3-D transform filtering. In this paper wepropose a new block grouping algorithm in th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 4  شماره 

صفحات  -

تاریخ انتشار 2011